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EFFICIENT DISCONTINUOUS GALERKIN FINITE ELEMENT METHODS VIA 

BERNSTEIN POLYNOMIALS 

ROBERT C. KIRBY t 

Abstract. We consider the discontinuous Galerkin method for hyperbolic conservation laws, with some particu¬ 
lar attention to the linear acoustic equation, using Bernstein polynomials as local bases. Adapting existing techniques 
leads to optimal-complexity computation of the element and boundary flux terms. The element mass matrix, how¬ 
ever, requires special care. In particular, we give an explicit formula for its eigenvalues and exact characterization of 
the eigenspaces in terms of the Bernstein representation of orthogonal polynomials. We also show a fast algorithm 
for solving linear systems involving the element mass matrix to preserve the overall complexity of the DG method. 
Finally, we present numerical results investigating the accuracy of the mass inversion algorithms and the scaling of 
total run-time for the function evaluation needed in DG time-stepping. 
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1. Introduction. Bernstein polynomials, which are “geometrically decomposed” in the sense 
of [5] and rotationally symmetric, provide a flexible and general-purpose set of simplicial finite 
element shape functions. Morever, recent research has demonstrated distinct algorithmic advantages 
over other simplicial shape functions, as many essential elementwise finite element computations can 
be performed on with optimal complexity using Bernstein polynomials In [18] , we showed how, with 
constant coefficients, elementwise mass and stiffness matrices could each be applied to vectors in 
operations, where n is the degree of the local basis and d is the spatial dimension. Similar 
blockwise linear algebraic structure enabled quadrature-based algorithms in [5D] . Around the same 
time, Ainsworth et al ^ showed that the Duffy transform reveals a tensorial structure in the 
Bernstein basis itself, leading to sum-factored algorithms for polynomial evaluation and moment 
computation. Moreover, they provide an algorithm that assembles element matrices with 0{1) 
work per entry that utilizes their fast moment algorithm together with a very special property of 
the Bernstein polynomials. Work in mus] extends these techniques to R(div) and R(curl). 

In this paper, we consider Bernstein polynomial techniques in a different context - discontinuous 
Galerkin methods for hyperbolic conservation laws 

gt+V-F(g)=0, (1.1) 

posed on a domain D x [0, T) C x K, together with suitable initial and boundary conditions. As 
a particular example, we consider the linear acoustic model 


Pt + V • M = 0, 
Ut + Vp = 0, 


( 1 . 2 ) 


Here, q = [u,p]’^ where the pressure variable p is a scalar-valued function on D x [0,T] and the 
velocity u maps the same space-time domain into 

Discontinuous Galerkin (DG) methods for such problems place finite volume methods in a 
variational framework and extend them to higher orders of polynomial approximation [5] , but fully 
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realizing the potential efficiencies of high-order methods requires careful consideration of algorithmic 
issues. Simplicial orthogonal polynomials 0113 provide one existing mechanism for achieving 
low operation counts. Their orthogonality gives diagonal local mass matrices. Optimality then 
requires special quadrature that reflects the tensorial nature of the basis under the Duffy transform 
or collapsed-coordinate mapping from the d-simplex to the d-cube and also includes appropriate 
points to incorporate contributions from both volume and boundary flux terms. Hesthaven and 
Warburton [HE] propose an alternate approach, using dense linear algebra in conjunction with 
Lagrange polynomials. While of greater algorithmic complexity, highly-tuned matrix multiplication 
can make this approach competitive or even superior at practical polynomial orders. Additional 
extensions of this idea include the so-called “strong DG” forms and also a pre-elimination of the 
elementwise mass matrix giving rise to a simple ODE system. With care, this approach can give 
very high performance on both CPU and GPU systems jH] . 

In this paper, we will show how each term in the DG formulation with Bernstein polynomials as 
the local basis can be handled with optimal complexity For the element and boundary flux terms, 
this requires only an adaptation of existing techniques, but inverting the element mass matrix turns 
out to be a challenge lest it dominate the complexity of the entire process. We rely on the recursive 
block structure described in [T5] to give an algorithm for solving linear systems with the 

constant-coefficient mass matrix. We may view our approach as sharing certain important features 
of both collapsed-coordinate and Lagrange bases. Like collapsed-coordinate methods, we seek to 
use specialized structure to optimize algorithmic complexity. Like Lagrange polynomials, we seek 
to do this using a relatively discretization-neutral basis. 

2. Discontinuous Galerkin methods. We let 7^ be a triangulation of U in the sense of |5] 
into affine simplices. For curved-sided elements, we could adapt the techniques of |32j to incorporate 
the Jacobian into our local basis functions to recover the reference mass matrix on each cell at the 
expense of having variable coefficients in other operators, but this does not affect the overall order 
of complexity. We let £h denote the set of all edges in the triangulation. 

For T G Th, Pn{T) be the space of polynomials of degree no greater than n on T. This is a 
vector space of dimension ■ We define the global finite element space 

Vh = {f:n^R:f\TGPu{T),TGrh}, ( 2 . 1 ) 


with no continuity enforced between cells. Let (•, ■)j, denote the standard inner product over 
T G Th and the L 2 inner product over an edge 'j G £h- 

After multiplying (1.1) by a test function and integating by parts elementwise, a DG method 
seeks Uh in Vt such that 


[{uh,t,Vh)rp-{F{uh),'^Vh)rp] + '^{F ■n,Vh) (2.2) 

T&Th l^£h 

for all Vh GVh- 

Fully specifying the DG method requires defining a numerical flux function F on each 7 . On 
internal edges, it takes values from either side of the edge and produces a suitable approximation 
to the flux F. Many Riemann solvers from the finite volume literature have been adapted for DG 
methods miniisni- The particular choice of numerical flux does not matter for our purposes. On 
external edges, we choose F to appropriately enforce boundary conditions. 

This discretization gives rise to a system of ordinary differential equations 


Mut + F(u) = 0, 


(2.3) 
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where M is the block-diagonal mass matrix and F(u) includes the cell and boundary flux terms. 
Because of the hyperbolic nature of the system, explicit methods are frequently preferred. A forward 
Euler method, for example, gives 

u”+i = u” - AtM-iF(u") = u" - AtL(u”), (2.4) 

which requires the application of M at each time step. The SSP methods [n mi give stable 
higher-order in time methods. For example, the well-known third order scheme is 

= u" -h AtL(u”), 

u-2 = 3^»+lu"’i + iAtL(u-i), (2.5) 

u"+i = iu” + ^u”’2 + 

o o o 

Since the Bernstein polynomials give a dense element mass matrix, applying efficiently 
will require some care. It turns out that M possesses many fascinating properties that we shall 
survey in Section]^ Among these, we will give an algorithm for applying the elementwise 

inverse. 

DG methods yield reasonable solutions to acoustic or Maxwell’s equations without slope lim¬ 
iters, although most nonlinear problems will require them to suppress oscillations. Even linear 
transport can require limiting when a discrete maximum principle is required. Limiting high-order 
polynomials on simplicial domains remains quite a challenge. It may be possible to utilize properties 
of the Bernstein polynomials to design new limiters or conveniently implement existing ones. For 
example, the convex hull property (i.e. that polynomials in the Bernstein basis lie in the convex 
hull of their control points) gives sufficient conditions for enforcing extremal bounds. We will not 
offer further contributions in this direction, but refer the reader to other works on higher order 
limiting such as UHl 1331133 ] ■ 

3. Bernstein-basis finite element algorithms. 

3.1. Notation for Bernstein polynomials. We formulate Bernstein polynomials on the 
d-simplex using barycentric coordinates and multiindex notation. For a nondegenerate simplex 
T C K'’* with vertices let {6i}f=o denote the barycentric coordinates. Each bi affinely maps 

into K with hi{xj) = Sij for 0 < i,j < d. It follows that bi(x) > 0 for all x € T. 

We will use common multiindex notation, denoting multiindices with Greek letters, although we 
will begin the indexing with 0 rather than 1. So, a = (oq; cni,..., ad) is a tuple of d-l-1 nonnegative 
integers. We define the order of a multiindex a by |a| = J2i=o^i- ^dat a > j3 provided 

that the inequality Ui > Pi holds componentwise for 0 < i < d. Factorials and binomial coefficients 
over multiindices have implied multiplication. That is, 

d 

a\ = Y[a^\ 

2=0 


and, provided that a > /3, 
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Without ambiguity of notation, we also define a binomial coefficient with a whole number for the 
upper argument and and multiindex as lower by 

/ n\ n\ n\ 

W a! 

We also define to be the multiindex consisting of zeros in all but the entry, where it is one. 

Let b = (6 oj & 2 ) • ■ •) bd) be a tuple of barycentric coordinates on a simplex. For multiindex a, 
we define a barycentric monomial by 


b“=n^“ 


2 = 0 


We obtain the Bernstein polynomials by scaling these by certain binomial coefficients 


ji\ 

Bl = -b“. 
a! 


(3.1) 


For all spatial dimensions and degrees n, the Bernstein polynomials of degree n 


form a nonnegative partition of unity and a basis for the vector space of polynomials of degree 
n. They are suitable for assembly in a (7° fashion or even into smoother splines [23]. While DG 
methods do not require assembly, the geometric decomposition does make handling the boundary 
terms straightforward. 

Crucial to fast algorithms using the Bernstein basis, as originally applied to (7° elements [TIITB]. 
is the sparsity of differentiation. That is, it takes no more than d+1 Bernstein polynomials of degree 
n — 1 to represent the derivative of a Bernstein polynomial of degree n. 

For some coordinate direction s, we use the general product rule to write 


d 


dBl: 


ds ds 




n! 

a! 


2 = 0 


db. 


ds 




with the understanding that a term in the sum vanishes if ai = 0. This can readily be rewritten as 

d 


ds 


2 = 0 


^ Qs 


(3.2) 


again with the terms vanishing if any ai = 0, so that the derivative of each Bernstein polynomial 
is a short linear combination of lower-degree Bernstein polynomials. 

Iterating over spatial directions, the gradient of each Bernstein polynomial can be written as 


d 

VBl=nY,Bl-_lyh. (3.3) 

2 = 0 

Note that each Vhi is a fixed vector in for a given simplex T. In US], we provide a data 
structure called a pattern for representing gradients as well as exterior calculus basis functions. For 
implementation details, we refer the reader back to m- 
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The degree elevation operator will also play a crucial role in our algorithms. This operator 
expresses a B-form polynomial of degree n — 1 as a degree n polynomial in B-form. For the 
orthogonal and hierarchical bases in this operation would be trivial - appending the requisite 
number of zeros in a vector, while for Lagrange bases it is typically quite dense. Whiel not trivial, 
degree elevation for Bernstein polynomials is still efficient. Take any Bernstein polynomial and 
multiply it by bi = 1 to hnd 








^ ~ 1)! , g + ej ^ + 1 n! . g+ej 

^ a! ^ n (a + e,)! 

2 = 0 2=0 


= E 


Cti + 1 


B 


a+Ci ■ 


2 = 0 


(3.4) 


We could encode this operation as a x Pn_i matrix consisting of exactly d+ \ nonzero entries, 
but it can also be applied with a simple nested loop. At any rate, we denote this linear operator 
as where n is the degree of the resulting polynomial. We also denote operation 

that successively raises a polynomial from degree ni into n 2 - This is just the product of n 2 — ni 
(sparse) operators: 


^d,ni,n2 _ ^d,n2 ^d,ni + l 


(3.5) 


We have that as a special case. 


3.2. Stroud conical rules and the Duffy transform. The Duffy transform tensorializes 
the Bernstein polynomials, so sum factorization can be used for evaluating and integrating these 
polynomials with Stroud conical quadrature. We used similar quadrature rules in our own work 
on Bernstein-Vandermonde-Gauss matrices ED], but the connection to the Duffy transform and 
decomposition of Bernstein polynomials was quite cleanly presented by Ainsworth et aZ in [1] . 

The Duffy transform maps any point t = (ti, ^ 2 , ■ • ■, tn) in the d-cube [0,1]" into the barycentric 
coordinates for a d-simplex by first defining 


and then inductively by 


for 1 < i < d — 1, and then finally 


Aq — ti 


Ai — ti-\-i I 1 ^ ( Aj 


d-i 

a, = i-Ea,- 

3=0 


If a simplex T has vertices {xi}(Lg, then the mapping 


x(t) = ExiAi(t) 


i=0 


(3.6) 


(3.7) 


(3.8) 


(3.9) 
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maps the unit d-cube onto T. 

This mapping can be used to write integrals over T as iterated weighted integrals over [0,1]^^ 

J /(x)dx=-!^y J dt2{l - ■ J dttf{x{t)). (3.10) 

The Stroud conical rule |29j is based on this observation and consists of tensor products of certain 
Gauss-Jacobi quadrature weights in each ti variable, where the weights are chosen to absorb the 
factors of (1 — These rules play an important role in the collapsed-coordinate framework 

of |17| among many other places. 

As proven in [T], pulling the Bernstein basis back to [0,1]"^ under the Duffy transform reveals a 
tensor-like structure. It is shown that with Bf{t) = the one-dimensional Bernstein 

polynomial, that 

B2(x(t)) = B2„{h)B2ro{t2) ■ ■ ■ (3.11) 

This is a “ragged” rather than true tensor product, much as the collapsed coordinate simplicial 
bases m, but entirely sufficient to enable sum-factored algorithms. 

3.3. Basic algorithms. The Stroud conical rule and tensorialization of Bernstein polynomials 
under the Duffy transformation lead to highly efficient algorithms for evaluating B-form polynomials 
and approximating moments of functions against sets of Bernstein polynomials. 

Three algorithms based on this decomposition turns out to be fundamental for optimal assembly 
and application of Bernstein-basis bilinear forms. First, any polynomial u(x) — 
may be evaluated at the Stroud conical points in operations. In [501, result is pre¬ 

sented as exploiting certain block structure in the matrix tabulating the Bernstein polynomials at 
quadrature points. In [1], it is done by explicitly factoring the sums. 

Second, given some function /(x) tabulated at the Stroud points, it is possible to approximate 
the set of Bernstein moments 


M”(/)=y/(x)B2dx 


for all |a| = n via Stroud quadrature in operations. In the the case where / is constant on 

T, we may also use the algorithm for applying a mass matrix in [I8j to bypass numerical integration. 

Finally, it is shown in [T] that the moment calculation can be adapted to the evaluation of 
element mass and hence stiffness and convection matrices utilizing another remarkable property of 
the Bernstein polynomials. Namely, the product of two Bernstein polynomials of any degrees is, up 
to scaling, a Bernstein polynomial of higher degree: 


fa+^\ 

Tjni p’T’2 _ \ a / p'^i+'^2 

~ ^ni+n2^^a-\-l3 ’ 


(3.12) 


Also, the first two algorithms described above for evaluation and moment calculations demon¬ 
strate that M may be applied to a vector without explicitly forming its entries in only 
entries. In US], we show how to adapt these algorithms to short linear combinations of Bernstein 
polynomials so that stiffness and convection matrices require the same order of complexity as the 


mass. 
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3.4. Application to DG methods. As part of each explicit time stepping stage, we must 
evaluate M“^F(u). Evaluating F(u) requires handling the two flux terms in (2.2). To handle 


{F{uh),'Vvh) 


T ’ 


we simply evaluate Uh at the Stroud points on T, which requires operations. Then, 

evaluating F at each of these points is purely pointwise and so requires but 0{n‘^). Finally, the 
moments against gradients of Bernstein polynomials also requires operations. This term, 

then, is readily handled by existing Bernstein polynomial techniques. 

Second, we must address, on each interface 7 G f, 


{F ■ n,Vh)^. 


The numerical flux F ■ n requires the values of Uh on each side of the interface and is evaluated 
pointwise at each facet quadrature point. Because of the Bernstein polynomials’ geometric decom¬ 
position, only basis functions are nonzero on that facet, and their traces are in fact exactly 

the Bernstein polynomials on the facet. So we have to evaluate two polynomials (the traces from 
each side) of degree n in d — 1 variables at the facet Stroud points. This requires 0{n‘^) operations. 
The numerical flux is computed pointwise at the points, and then the moment integration 

is performed on facets for an overall cost of 0{n‘^) for the facet flux term. In fact, the geometric 
decomposition makes this term much easier to handle optimally with Bernstein polynomials than 
collapsed-coordinate bases, although though specially adapted Radau-like quadrature rules, the 
boundary sums may be lifted into the volumetric integration m- 

The mass matrix, on the other hand, presents a much deeper challenge for Bernstein polynomials 
than for collapsed-coordinate ones. Since it is dense with 0{n‘^) rows and columns, a standard 
matrix Cholesky decomposition requires operations as a startup cost, followed by a pair of 

triangular solves on each solve at each. For d > 1, this complexity clearly dominates the 

steps above, although an optimized Cholesky routine might very well win at practical orders. In 
the next section, we turn to a careful study of the mass matrix, deriving an algorithm of optimal 
complexity. 

4. The Bernstein mass matrix. We begin by defining the rectangular Bernstein mass ma¬ 
trix on a d-simplex T by 


(4.1) 

where m,n > 0. 

By a change of variables, we can write 

^^d,m.n|y|d!, (4 2) 

where is the mass matrix on the unit right simplex Sd in d-space and |T| is the d-dimensional 

measure of T. When m = n, we suppress the third superscript and write M'^’ or We include 

the more general case of a rectangular matrix because such will appear later in our discussion of 
the block structure. 

This mass matrix has many beautiful properties. Besides the block-recursive structure devel¬ 
oped in |18] , it is related to the Bernstein-Durrmeyer operator [HE] of approximation theory. Via 
this connection, we provide an exact characterization of its eigenvalues and associated eigenspaces in 
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the square case m = n. Finally, and most pertinent to the case of discontinuous Galerkin methods, 
we describe algorithms for solving linear systems involving the mass matrix. 

Before proceeding, we recall from [18] that, formulae for integrals of products of powers of 
barycentric coordinates, the mass matrix formula is exactly 


m; 


d, 7 n,n _ m\n\{a + l3)l 


{m + n + d)!a!/3! 

4.1. Spectrum. The Bernstein-Durrmeyer operator [7| is defined on by 

(n + dy. 


Dnif) = 


|Q!|=n 


(4.3) 


(4.4) 


This has a structure similar to a discrete Fourier series, although the Bernstein polynomials are 
orthogonal. The original Bernstein operator [23] has the form of a Lagrange interpolant, although 
the basis is not interpolatory. 

For i > 1, we let Qi denote the space of d-variate polynomials of degree i that are orthogonal 
to all polynomials of degree f — 1 on the simplex. The following result is given in [7], and also 
referenced in m to generate the B-form of simplicial orthogonal polynomials. 

Theorem 4.1 (Derriennic). For each 0 <i <n, each 


— 


(n + dy.nl 


(n + i + d)l {n — i)\ 


is an eigenvalue of corresponding to the eigenspace Qi. 

This gives a sequence of eigenvalues Aq,™ > ^i,n > ■ ■ ■ > Xn,n > 0, each corresponding to 
polynomial eigenfunctions of increasing degree. 

Up to scaling, the Bernstein-Durrmeyer operator restricted to polynomials exactly corre¬ 
sponds to the action of the mass matrix. To see this, suppose that 9 p = X)|a|=n Then 


n! 


dy 


Dnip)= E 


\a\=r, 


= E E kBIb: 


b: 


\a\=r, 


(4.5) 


^|/ 3 |=n 

= E T.pi>W.b2)b: 

|Q;|=n |^|=n 

= E f E "J-ow I 

\a\=n \|/ 3 |=n / 

This shows that the coefficients of the B-form of Dn{p) are just the entries of the Bernstein 
mass matrix times the coefficients of p. Consequently, 

Theorem 4.2. For each 0 < i < n, each 


Aj_ rj. 


dy.{r 


{n + i + dy. {n — iy. 
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is an eigenvalue of of multiplicity of , and the eigenspace is spanned by the B-form of 

any basis for Qt. 

This also implies that the Bernstein mass matrices are quite ill-conditioned in the two norm, 
using the characterization in terms of extremal eigenvalues for SPD matrices. 

Corollary 4.3. The 2-norm condition number of is 

^o,n _ (2n -|- d)l . , 

A„,„ {n-\-d)\n\ 


However, the spread in eigenvalues does not tell the whole story. We have exactly n-l-1 distinct 
eigenvalues, independent of the spatial dimension. This shows significant clustering of eigenvalues 
when d > 1. 

Corollary 4.4. In exact arithmetic, unpreconditioned conjugate gradient iteration will solve 
a linear system of the form = y in exactly n-l-1 iterations, independent of d. 

If the fast matrix-vector algorithms in [HUH] are used to compute the matrix-vector product, 
this gives a total operation count of Interestingly, this ties the per-element cost of Cholesky 

factorization when d = 2, but without the startup or storage cost. It even beats a pre-factored 
matrix when d > 2, but still loses asymptotically to the cost of evaluating F(u). However, in light 
of the large condition number given by Corollary |4.3[ it is doubtful whether this iteration count 
can be realized in actual floating point arithmetic. 

The high condition number also suggests an additional source of error beyond discretization 
error. Suppose that we commit an error of order e in solving Mx = y, computing instead some x 
such that \\x — x|| = e in the oo norm. Let u and u be the polynomial with B-form coefficients x 
and X, respectively. Because a polynomial in B-form lies in the convex hull of its control points [23] . 
we also know that u and u differ by at most this same e in the max-norm. Consequently, the 
roundoff error in mass inversion can conceivably pollute the finite element approximation at high 
order, although ten-digit accuracy, say, will still only give a maximum of 10“ additional pointwise 
error in the finite element solution - typically well below discretization error. 


4.2. Block structure and a fast solution algorithm. Here, we recall several facts proved 
in [IH] related to the block structure of which we will apply now for solving square systems. 

We consider partitioning the mass matrix formula (4.3) by freezing the first entry in a and /?. 
Since there are m -I- 1 possible values for for oq and n 1 for /3o, this partitions into an 

(to -I- 1) X (n -I-1) array, with blocks of varying size. In fact, each block is x P^Zg^- 

These blocks are themselves, up to scaling, Bernstein mass matrices of lower dimension. In 
particular, we showed that 


Q:0,P0 


)a) 


Kapt XPo 




d) 


-M' 


d—l,m—cx.Q,n—(3Q 


(4.7) 


We introduce the (to-|- 1) x (n-l-1) array consisting of the scalars multiplying the lower-dimensional 
mass matrices as 


d.m.n 
y Q 
OiQ.PO 


(:;)(?.) 




(4.8) 
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SO that satisfies the block structure, with superscripts on v terms dropped for clarity 








(4.9) 




We partition the right-hand side and solution vectors y and x conformally to M, so that the 
block yj is of dimension P^Z.^ and corresponds to a polynomial’s i?-form coefficients with first 
indices equal to j. We write the linear system in an augmented block matrix as 


(4.10) 


From [18) . we also know that mass matrices of the same dimension but differing degrees are 
related via degree elevation operators by 


/ z/o.oM"*-!-"’’" 


j/Q 

2/0 \ 



1 

1 

2/1 



1/ 

Vn / 


M 


d.m—1 


= {E'^ 


and 


Iteratively, these results give 


for 1 < i < m and 


M 




= {E 





(4.11) 

,n ^d,n 

(4.12) 


(4.13) 

^d,n—j,n 

(4.14) 


for 1 < j < n. In |18) . we used these features to provide a fast algorithm for matrix multiplication, 
but here we use them to efficiently solve linear systems. 

Carrying out blockwise Gaussian elimination in (4.10), we multiply the first row, labeled with 
0 rather than 1, by ^ subtract from row 1 to introduce a zero block 

below the diagonal. However, this simplifies, as (|4.11 ) tells us that 


M 


d—l,n—l,‘ 


{M 


d—lz 


") = m' 


fd—l, 


{M 


d—1,' 


= {E 


d—1,', 


(4.15) 


Because of this, along row 1 for j > 1, the elimination step gives entries of the form 


vijM' 




^'OO 




but (4.11) renders this as simply 




= 


^10 \ (I—n—j 

^00 


(4.16) 
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That is, the row obtained by block Gaussian elimination is the same as one would obtain simply 
by performing a step of Gaussian elimination on the matrix of coefficients containing the 

V values above, as the matrices those coefficients scale do not change under the row operations. 
Hence, performing elimination on the (n + 1) x (n + 1) matrix, independent of the dimension d, 
forms a critical step in the elimination process. After the block upper triangularization, we arrive 
at a system of the form 





yo \ 

0 



yi 

V 0 

0 


yn / 


where the tildes denote that quantities updated through elimination. The backward substition 
proceeds along similar lines, though it requires the solution of linear systems with mass matrices in 
dimension d—\. Multiplying through each block row by then gives, using (4.14) 


0 ’ I 

Vo 0 

where the primes denote quantities updated in the process. We reflect this in the updated N matrix 
by scaling each row by its diagonal entry as we proceed. At this point, the last block of the solution 
is revealed, and can be successively elevated, scaled, and subtracted from the right-hand side to 
eliminate it from previous blocks. This reveals the next-to last block, and so-on. We summarize 
this discussion in Algorithm |4. 2 [ 

Since we will need to solve many linear systems with the same element mass matrix, it makes 
sense to extend our elimination algorithm into a reusable factorization. We will derive a block- 
wise LDL^ factorization of the element matrix, very much along the lines of the standard factor- 
izatin [28] . 

Let A^'^’" be the matrix of coefficients given in 
ization 

Afd.n ^ 2.^1”*, (4.19) 

with iij and da the entries of and respectively. We also define UVr" = 

Uij — dii£ji 

Then, we can use the block matrix 



/ 

I 

0 . 

• 



-£l0 

I . 

. 0 

II 

0 



0 . 

. 0 


V 


0 . 

■ l) 


(4.8). Suppose that we have its LDL^ factor- 


77 / TT’d—l,0,n 
^O.n^ 

77^/ pci—l,0,n—1 
^1,71-^ 


y'o 

y[ 


y'n / 


(4.18) 
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Algorithm 1 Block-wise Gaussian elimination for solving = y 

Require: Input vector y 

Ensure: On output, y is overwritten with 

Initialize coefficient matrix Na^b '■= 
for a := 0 to n do {Forward elimination} 

Z^Va 

for & := a -I- 1 to n do 

2 ^ (E^-hd-b+l'^T^ 

Vb^Vb- J^z 

for c := a to n do {Elimination on N} 

AT / AT 

T^b,c ^ b\b^c -— 

end for 
end for 
end for 

for a := 0 to n do {Lower-dimensional inversion} 


Va ^ {M 

for & := a to n do 


— l,n—a,n—a\ 


Va 


Nb 


'^b,a ^ ^ 

end for 
end for 

for a := n to 0 do {Backward elimination} 
Z^ya 

for & := a — 1 to 0 do 

Ub^Vb- Nb,aZ 

end for 
end for 


Nb,u 


to act on to produce zeros below the diagonal in the first block of columns. Similarly, we act 

on with 


Li 


0 

0 


0 

/ 


-hi (E 


d—1,71—2,n— 1 


T 


o\ 

0 

0 


Vo (E'i-bO.n-l)^ ... ij 


to introduce zeros below the diagonal in the second block of columns. Indeed, we have a sequence 




Efficient DGFEM via Bernstein polynomials 


13 


of block matrices for 0 < A: < n such that Lf, is } x ] with 



I 

0 

0 


{E~ 




for i = j 

for i ^ j and j ^ k 
for i < j and j = k 
for i > j and j = k 


Then, in fact, we have that 




0 


V 0 




Much as with elementary row matrices for classic LU factorization, we can invert each of these 
E matrices simply by flipping the sign of the multiplier, so that 



I 

0 

0 


{E' 




for i = j 

for i ^ j and j ^ k 
for i < j and j = k 
for i > j and j = k 


Then, we define to be the inverse of these products 

pd^r, ^ (^n-lJn-2 ^ = (Z°'^ \ ^ ( 4 . 20 ) 

so that ^ is block upper triangular. Like standard factorization, we can also 

multiply the elimination matrices together so that 


(^d.n) 1 


/ I 

-ho 

-ho 


0 

I 


-hi {Pd 


— l,n—2,n—1 


t 


o\ 

0 

0 


(4.21) 


V -ho {Ed-^^^^r.y 


-hi {E' 


d-l,0,n-l\ 


ij 


Moreover, we can turn the block upper triangular matrix into a block diagonal one times the 
transpose of giving a kind of block LDL^ factorization. We factor out the pivot blocks from 
each row, using (4.15) so that 


jjd,n _ 


/dooMd-^’^’^ 

0 

0 \ 

1 

0 

^ll^d-l.n-l.n-1 ^ 

0 


V 0 

0 

. dnuMd-^Eo) 



. . . hlPd-^E 


VO 


The factor on the right is just 
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We introduce the block-diagonal matrix A'^’" by 


An = diiM 


d—l,n—i 


(4.22) 


Our discussion has established: 

Theorem 4.5. The Bernstein mass matrix admits the block factorization 

j\^d,n _ j^d^n^d,n 


(4.23) 


We can apply the decomposition inductively do wn s patial dimension, so that each of the blocks 
in ^d,n 

can be also factored according to Theorem |4.5[ This fully expresses any mass matrix as a 


diagonal matrix sandwiched in between sequences of sparse unit triangular matrices. 

So, computing the factorization of requires computing the LDL^ factorization 

of the one-dimensional coefficient matrix Supposing we use standard direct method such as 

Cholesky factorization to solve the one-dimensional mass matrices in the base case, we will have 
a start-up cost of factoring n -|- 1 matrices of size no larger than n -I- 1. With Cholesky, this is a 
0{n^) process, although since the one-dimensional matrices factor into into Hankel matrices pre- 
and post-multiplied by diagonal matrices, one could use Levinson’s or Bareiss’ algorithm [am] to 
obtain a merely 0{n^) startup phase. 


Algorithm 2 Mass inversion via block-recursive LDLr’' factorization for d > 2 
Require: factored as A"’*’" = LDLF 

Require: Input vector y 
Ensure: On output, x = 

Initialize vector a: ■<— 0 

for a := 0 to n do {Apply (L'’*’")”^ to y, store in x} 

Z^Va 

for & := a -I- 1 to n do 

Xb^ Xb- Lb^aZ 

end for 
end for 

for a := 0 to n do {Overwrite x with (A‘’*’")“^a:} 

end for 

for a := n to 0 do {Overwrite x with (L"’'^)“^x} 
z Xa 

for & := a — 1 to 0 do 

Xb ^ Xb Tb^aZ 

end for 
end for 


Now, we also consider the cost of solving a linear system using the block factorization, pseu¬ 
docode for which is presented in Algorithm |4.2| In two dimensions, one must apply the inverse 
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of followed by the inverse of accomplished by triangular solves using pre-factored one¬ 
dimensional mass matrices, and the inverse of In fact, the action of applying 

requires exactly the same process as described above for block Gaussian elimination, except the 
arithmetic on the v values is handled in preprocessing. That is, for each block we will need 
to compute for 1 < i < n — j — 1 and accumulate scalings of these vectors into 

corresponding blocks of the result. Since these elevations are needed for each it is helpful to reuse 
these results. Applying then requires applying for all valid i and j, together with 

all of the axpy operations. Since the one-dimensional elevation into degree i has 2{i + 1) nonzeros 
in it, the required elevations required cost 

1 : 1 : 20 + l) = d!l±^C^, (4.24) 

i = l i = l 


operations, which is 0(n^), and we also have a comparable number of operations for the axpy-like 
operations to accumulate the result. A similar discussion shows that applying requires 

the same number of operations. Between these stages, one must invert the lower-dimensional 
mass matrices using the pre-computed Cholesky factorizations and perform the scalings to apply 
A“^. Since a pair of to x m triangular solves costs m{m + 1) operations, the total cost of the 
one-dimensional mass inversions is 

V(* -f 1)(* + 2) = (^+l)(^ + 2)(n + 3) ^ 

O 

2 = 0 

together with the lower-order term for scalings 

y:p‘=5:(i+i)=i^i±A^, 

2=0 2=0 


So, the whole three-stage process is 0{n^) per element. 

In dimension d > 2, we may proceed inductively in space dimension to show that Algorithm |4.2| 
requires, after start-up, operations. The application of A“^ will always require n + 1 

inversions of {d — l)-dimensional mass matrices , each of which costs operations by the 

induction hypothesis. Inverting A*^’" onto a vector will cost operations for all n and d. 

To see that a similar complexity holds for applying the inverses of and its transpose, one can 


simply replace the summand in (4.24) with 2P^ ^ and execute the sum. To conclude, 
Theorem 4.6. Algorithm 4-2 applies the inverse of to an arbitrary vector in 

operations. 


5. Numerical results. 


5.1. Mass inversion. Because of Corollary |4.3i we must pay special attention to the accuracy 
with which linear systems involving the mass matrix are computed. We began with Cholesky 
decomposition as a baseline. For degrees one through twenty in one, two, and three space dimension, 
we explicitly formed the reference mass matrix in Python and used the scipy |I6j interface LAPACK 
to form the Cholesky decomposition. Then, we chose several random vectors to be sample solutions 
and formed the right-hand side by direct matrix-vector multiplication. In Figure [5T| we plot the 
relative accuracy of a function of degree in each space dimension. Although we observe expontial 
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Fig. 5.1. Relative accuracy of solving linear systems with mass matrices of various degrees using Cholesky 
decomposition. 


growth in the error (fully expected in light of Corollary |4.3[ ), we see that we still obtain at least ten 
digits of relative accuracy up to degree ten. 

Second, we also attempt to solve the linear system using conjugate gradients. We again used 
systems with random solution, and both letting CG run to a relative residual tolerance of 10“^^ 



Finally, our block algorithm gives accuracy comparable to that of Cholesky factorization. 
Our two-dimensional implementation of Algorithm |4.2| uses Cholesky factorizations of the one¬ 
dimensional mass matrices. Rather than full recursion, our three-dimensional implementation uses 
Cholesky factorization of the two-dimensional matrices. At any rate. Figure 5.4 shows, when 
compared to Figure |5.H that we lose very little additional accuracy over Cholesky factorization. 


Whether replacing the one-dimensional solver with a specialized method for totally positive matri¬ 
ces [22] would also give high accuracy for the higher-dimensional problems will be the subject of 
future investigation. 


5.2. Timing for first-order acoustics. We fixed a 32 x 32 square mesh subdivided into right 
triangles and computed the time to perform the DC function evaluation (including mass matrix 
inversion) at various polynomial degrees. We used the mesh from DOLFIN [2S| and wrote the 
Bernstein polynomial algorithms in Cython With an 0{v?) complexity for two-dimensional 
problems, we expect a doubling of the polynomial degree to produce an eightfold increase in run¬ 
time. In Figure [O] though, we see even better results. In fact, a least-squares fit of the log-log 
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,,0 Accuracy of CG with fixed tolerance for mass inversion 


Polynomial degree 


(a) Accuracy obtained by iterating until a resid¬ 
ual tolerance of 10“^^. 


CG Iterations to reach fixed tolerance 


Polynomial degree 


(b) CG iterations required to solve = y 

to a tolerance of 


Fig. 5.2. Accuracy obtained solving mass matrix system using conjugate gradient iteration in one, two, and 
three space dimensions. 


Accuracy of CG with fixed iterations for mass inversion 



Fig. 5.3. Relative accuracy of solving = y using exactly n + 1 CG iterations. 


data in this table from degrees five to fifteen gives a very near fit with a slope of less than two 
(about 1.7) rather than three. Since small calculations tend to run at lower flop rates, it is possible 
that we are far from the asymptotical regime predicted by our operation counts. 

6. Conclusions and Future Work. Bernstein polynomials admit optimal-complexity algo¬ 
rithms for discontinuous Galerkin methods for conservation laws. The dense element mass matrices 
might, at first blush, seem to prevent this, but their dimensionally recursive block structure and 
other interesting properties, lead to an efficient blockwise factorzation. Despite the large condi- 
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Fig. 
the block 


5.4. Relative accuracy of solving linear systems with mass matrices of various degrees using one level of 
algorithm with Cholesky factorization for lower-dimensional matrices. 


tion numbers, our current algorithms seem sufficient to deliver reasonable accuracy at moderate 
polynomial orders. 

On the other hand, these results still leave much room for future investigation. First, it makes 
sense to explore the possibilities of slope limiting in the Bernstein basis. Second, while our mass 
inversion algorithm is sufficient for moderate order, it may be possible to construct a different algo¬ 
rithm that maintains the low complexity while giving higher relative accuracy, enabling very high 
approximation orders. Perhaps such algorithms will either utilize the techniques in |22j internally, 
or else extend them somehow. Finally, our new algorithm, while of optimal compexity, is quite 
intricate to implement and still is not well-tuned for high performance. Finding ways to make these 
algorithms more performant will have important practical benehts. 
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